{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center/>转换数据集为MindRecord"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 概述\n",
    "\n",
    "用户可以将非标准的数据集和常用的数据集转换为MindSpore数据格式，即MindRecord，从而方便地加载到MindSpore中进行训练。同时，MindSpore在部分场景做了性能优化，使用MindRecord数据格式可以获得更好的性能体验。\n",
    "\n",
    "MindSpore数据格式具备的特征如下：\n",
    "- 实现多变的用户数据统一存储、访问，训练数据读取更加简便。\n",
    "- 数据聚合存储，高效读取，且方便管理、移动。\n",
    "- 高效的数据编解码操作，对用户透明、无感知。\n",
    "- 可以灵活控制分区的大小，实现分布式训练。\n",
    "\n",
    "MindSpore数据格式的目标是归一化用户的数据集，并进一步通过`MindDataset`实现数据的读取，并用于训练过程。\n",
    "\n",
    "![data_conversion_concept](https://gitee.com/mindspore/docs/raw/master/tutorials/training/source_zh_cn/advanced_use/images/data_conversion_concept.png)\n",
    "\n",
    "> 本文档适用于CPU、GPU和Ascend环境。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 基本概念"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个MindRecord文件由数据文件和索引文件组成，且数据文件及索引文件暂不支持重命名操作：\n",
    "\n",
    "- 数据文件\n",
    "\n",
    "    包含文件头、标量数据页、块数据页，用于存储用户归一化后的训练数据，且单个MindRecord文件建议小于20G，用户可将大数据集进行分片存储为多个MindRecord文件。\n",
    "\n",
    "\n",
    "- 索引文件\n",
    "\n",
    "    包含基于标量数据（如图像Label、图像文件名等）生成的索引信息，用于方便的检索、统计数据集信息。\n",
    "\n",
    "![mindrecord](https://gitee.com/mindspore/docs/raw/master/tutorials/training/source_zh_cn/advanced_use/images/mindrecord.png)\n",
    "\n",
    "数据文件主要由以下几个关键部分组成：\n",
    "\n",
    "- 文件头\n",
    "    \n",
    "    文件头主要用来存储文件头大小、标量数据页大小、块数据页大小、Schema信息、索引字段、统计信息、文件分区信息、标量数据与块数据对应关系等，是MindRecord文件的元信息。\n",
    "\n",
    "\n",
    "- 标量数据页\n",
    "    \n",
    "    标量数据页主要用来存储整型、字符串、浮点型数据，如图像的Label、图像的文件名、图像的长宽等信息，即适合用标量来存储的信息会保存在这里。\n",
    "\n",
    "\n",
    "- 块数据页\n",
    "    \n",
    "    块数据页主要用来存储二进制串、Numpy数组等数据，如二进制图像文件本身、文本转换成的字典等。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 整体流程\n",
    "\n",
    "1. 准备环节。\n",
    "2. 将数据集转换为MindRecord。\n",
    "3. 读取MindRecord数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 准备环节"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 创建目录"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下载需要处理的图片数据`tansform.jpg`作为待处理的原始数据。  \n",
    "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/`用于存放本次体验中所有的转换数据集。  \n",
    "创建文件夹目录`./datasets/convert_dataset_to_mindrecord/images/`用于存放下载下来的图片数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-12-01 16:55:47--  https://gitee.com/mindspore/docs/raw/master/tutorials/notebook/convert_dataset_to_mindrecord/datasets/convert_dataset_to_mindrecord/images/transform.jpg\n",
      "Resolving proxy-notebook.modelarts-dev-proxy.com (proxy-notebook.modelarts-dev-proxy.com)... 192.168.0.172\n",
      "Connecting to proxy-notebook.modelarts-dev-proxy.com (proxy-notebook.modelarts-dev-proxy.com)|192.168.0.172|:8083... connected.\n",
      "Proxy request sent, awaiting response... 200 OK\n",
      "Length: unspecified [image/jpeg]\n",
      "Saving to: ‘transform.jpg’\n",
      "\n",
      "transform.jpg           [ <=>                ]  82.15K  --.-KB/s    in 0.1s    \n",
      "\n",
      "Last-modified header missing -- time-stamps turned off.\n",
      "2020-12-01 16:55:47 (719 KB/s) - ‘transform.jpg’ saved [84126]\n",
      "\n",
      "./datasets/convert_dataset_to_mindrecord/images/\n",
      "└── transform.jpg\n",
      "\n",
      "0 directories, 1 file\n"
     ]
    }
   ],
   "source": [
    "!wget -N https://gitee.com/mindspore/docs/raw/master/tutorials/notebook/convert_dataset_to_mindrecord/datasets/convert_dataset_to_mindrecord/images/transform.jpg\n",
    "!mkdir -p ./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n",
    "!mkdir -p ./datasets/convert_dataset_to_mindrecord/images/\n",
    "!mv -f ./transform.jpg ./datasets/convert_dataset_to_mindrecord/images/\n",
    "!tree ./datasets/convert_dataset_to_mindrecord/images/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 将数据集转换为MindRecord"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将数据集转换为MindRecord主要分为以下5个步骤：\n",
    "\n",
    "1. 导入`FileWriter`类，用于将用户定义的原始数据写入，参数用法如下：\n",
    "\n",
    "    - `file_name` - MindSpore数据格式文件的文件名，本例使用变量`data_record_path`传入该参数。\n",
    "    - `shard_num` - MindSpore数据格式文件的数量，默认为1，取值范围在[1,1000]，本例使用4。\n",
    "\n",
    "2. 定义数据集Schema，Schema用于定义数据集包含哪些字段以及字段的类型，然后添加Schema，相关规范如下：\n",
    "\n",
    "    - 字段名：字母、数字、下划线。\n",
    "    - 字段属性`type`：int32、int64、float32、float64、string、bytes。\n",
    "    - 字段属性`shape`：如果是一维数组，用[-1]表示，如果是二维数组，用[m,n]表示，如果是三维数组，用[x,y,z]表示。\n",
    "\n",
    "    > - 如果字段有属性`shape`,则对应数据类型必须为int32、int64、float32、float64。\n",
    "    > - 如果字段有属性`shape`，则用户传入`write_raw_data`接口的数据必须为`numpy.ndarray`类型。\n",
    "\n",
    "    本例中定义了`file_name`字段，用于标注准备写入数据的文件名字，定义了`label`字段，用于给数据打标签，定义了`data`字段，用于保存数据。\n",
    "3. 准备需要写入的数据，按照用户定义的Schema形式，准备需要写入的样本列表。\n",
    "4. 添加索引字段，添加索引字段可以加速数据读取，改步骤为可选操作。\n",
    "5. 写入数据，最后生成MindSpore数据格式文件。接口说明如下：\n",
    "\n",
    "    - `write_raw_data`：将数据写入到内存之中。\n",
    "    - `commit`：将最终内存中的数据写入到磁盘。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mindspore.mindrecord import FileWriter\n",
    "import os \n",
    "\n",
    "# clean up old run files before in Linux\n",
    "data_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/'\n",
    "os.system('rm -f {}test.*'.format(data_path))\n",
    "\n",
    "# import FileWriter class ready to write data\n",
    "data_record_path = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord'\n",
    "writer = FileWriter(file_name=data_record_path,shard_num=4)\n",
    "\n",
    "# define the data type\n",
    "data_schema = {\"file_name\":{\"type\":\"string\"},\"label\":{\"type\":\"int32\"},\"data\":{\"type\":\"bytes\"}}\n",
    "writer.add_schema(data_schema,\"test_schema\")\n",
    "\n",
    "# prepeare the data contents\n",
    "file_name = \"./datasets/convert_dataset_to_mindrecord/images/transform.jpg\"\n",
    "with open(file_name, \"rb\") as f:\n",
    "    bytes_data = f.read()\n",
    "data = [{\"file_name\":\"transform.jpg\", \"label\":1, \"data\":bytes_data}]\n",
    "\n",
    "# add index field\n",
    "indexes = [\"file_name\",\"label\"]\n",
    "writer.add_index(indexes)\n",
    "\n",
    "# save data to the files\n",
    "writer.write_raw_data(data)\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "该示例会生成8个文件，成为MindRecord数据集。`test.mindrecord0`和`test.mindrecord0.db`称为1个MindRecord文件，其中`test.mindrecord0`为数据文件，`test.mindrecord0.db`为索引文件，生成的文件如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/\n",
      "├── test.mindrecord0\n",
      "├── test.mindrecord0.db\n",
      "├── test.mindrecord1\n",
      "├── test.mindrecord1.db\n",
      "├── test.mindrecord2\n",
      "├── test.mindrecord2.db\n",
      "├── test.mindrecord3\n",
      "└── test.mindrecord3.db\n",
      "\n",
      "0 directories, 8 files\n"
     ]
    }
   ],
   "source": [
    "!tree ./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6. 如果需要在现有数据格式文件中增加新数据，可以调用`open_for_append`接口打开已存在的数据文件，继续调用`write_raw_data`接口写入新数据，最后调用`commit`接口生成本地数据文件。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MSRStatus.SUCCESS"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "writer = FileWriter.open_for_append('./datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord0')\n",
    "writer.write_raw_data(data)\n",
    "writer.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取MindRecord数据集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面将简单演示如何通过`MindDataset`读取MindRecord数据集。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 导入读取类`MindDataset`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mindspore.dataset as ds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 首先使用`MindDataset`读取MindRecord数据集，然后对数据创建了字典迭代器，并通过迭代器读取了一条数据记录。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'data': array([255, 216, 255, ..., 159, 255, 217], dtype=uint8), 'file_name': array(b'transform.jpg', dtype='|S13'), 'label': array(1, dtype=int32)}\n"
     ]
    }
   ],
   "source": [
    "file_name = './datasets/convert_dataset_to_mindrecord/datas_to_mindrecord/test.mindrecord0'\n",
    "# create MindDataset for reading data\n",
    "define_data_set = ds.MindDataset(dataset_file=file_name)\n",
    "# create a dictionary iterator and read a data record through the iterator\n",
    "print(next(define_data_set.create_dict_iterator(output_numpy=True)))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "MindSpore-1.0.1",
   "language": "python",
   "name": "mindspore-1.0.1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
